cluster(lb): fix load balancer stuck degraded during bootstrap by mweibel · Pull Request #48 · cloudscale-ch/cluster-api-provider-cloudscale

mweibel · 2026-06-26T14:36:24Z

During initial cluster bootstrap the load balancer could get stuck in degraded or error and never recover on its own. Two things combined to cause this:

The health monitor was created before any control-plane node existed. With a monitor but no healthy members, the LB reports degraded/error.
The reconciler never requeued a non-running LB, so it never observed the LB recovering once the control-plane nodes came up. This was especially the case if it took a bit of time until control plane nodes were available.

Changes:

Requeue every 10s while the LB is pending (degraded/error)
Defer creating the health monitor until the control plane is initialized (at least one control-plane node up).
Move health monitor reconciliation to the end of reconcileNormal, after the load balancer and floating IP, since it depends on both being set.

Also fixes the cloudscaleMachineToCluster watch mapping, which filtered on the wrong label (MachineControlPlaneNameLabel instead of MachineControlPlaneLabel) and so never enqueued on control-plane machine changes.

During initial cluster bootstrap the load balancer could get stuck in `degraded` or `error` and never recover on its own. Two things combined to cause this: - The health monitor was created before any control-plane node existed. With a monitor but no healthy members, the LB reports `degraded`/`error`. - The reconciler never requeued a non-running LB, so it never observed the LB recovering once the control-plane nodes came up. This was especially the case if it took a bit of time until control plane nodes were available. Changes: 1. Requeue every 10s while the LB is pending (`degraded`/`error`) 2. Defer creating the health monitor until the control plane is initialized (at least one control-plane node up). 3. Move health monitor reconciliation to the end of `reconcileNormal`, after the load balancer and floating IP, since it depends on both being set. Also fixes the `cloudscaleMachineToCluster` watch mapping, which filtered on the wrong label (`MachineControlPlaneNameLabel` instead of `MachineControlPlaneLabel`) and so never enqueued on control-plane machine changes.

mweibel temporarily deployed to e2e June 26, 2026 14:46 — with GitHub Actions Inactive

mweibel force-pushed the fix-loadbalancer-readyness branch from a737632 to 944c5e1 Compare June 26, 2026 14:52

mweibel changed the title ~~cluster(lb): improve initial bootstrap behavior~~ cluster(lb): fix load balancer stuck degraded during bootstrap Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cluster(lb): fix load balancer stuck degraded during bootstrap#48

cluster(lb): fix load balancer stuck degraded during bootstrap#48
mweibel wants to merge 1 commit into
mainfrom
fix-loadbalancer-readyness

mweibel commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mweibel commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mweibel commented Jun 26, 2026 •

edited

Loading